home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-07-05 | 15.8 KB | 472 lines | [TEXT/R*ch] |
- fastDNAml 1.0
-
-
- Gary J. Olsen, Department of Microbiology
- University of Illinois, Urbana, IL
- gary@phylo.life.uiuc.edu
-
- Hideo Matsuda, Mathematics and Computer Science
- Argonne National Laboratory, Argonne, IL
- matsuda@mcs.anl.gov
-
- Ray Hagstrom, Physics
- Argonne National Laboratory, Argonne, IL
- hagstrom@mcs.anl.gov
-
- Ross Overbeek, Mathematics and Computer Science
- Argonne National Laboratory, Argonne, IL
- overbeek@mcs.anl.gov
-
-
-
- What is fastDNAml
-
- fastDNAml is a program derived from Joseph Felsenstein's version 3.3 DNAML
- (part of his PHYLIP package). Users should consult the documentation for
- DNAML before using this program.
-
- fastDNAml is an attempt to solve the same problem as DNAML, but to do so
- faster and using less memory, so that larger trees and/or more bootstrap
- replicates become tractable. Much of fastDNAml is merely a recoding of the
- PHYLIP 3.3 DNAML program from PASCAL to C.
-
- DNAML includes the following notice:
-
- version 3.3. (c) Copyright 1986, 1990 by the University of Washington and
- Joseph Felsenstein. Written by Joseph Felsenstein. Permission is granted to
- copy and use this program provided no fee is charged for it and provided that
- this copyright notice is not removed.
-
-
-
- Why is fastDNAml faster?
-
- Some recomputation of values has been eliminated (Joe Felsenstein has done
- much of this in version 3.4 DNAML).
-
- The optimization of branch lengths has been accelerated by changing from an EM
- method to Newton's method.
-
- The strategy for simultaneously optimizing all of the branches on the tree has
- been modified to spend less time getting an individual branch right before
- improving the other branches.
-
-
-
- Other new features in fastDNAml
-
- fastDNAml includes a checkpoint feature to regularly save its progress toward
- finding a large tree. If the program is interrupted, a minor change to the
- input file and adding the R (restart) option permits the work to be resumed
- from the last checkpoint.
-
- The new R {restart) option can also be used for more rapid addition of new
- sequences to a previously computed tree (when new sequences are added to the
- alignment, it is best if the relative alignment of the previous sequences is
- not altered).
-
- The G (global) option has been generalized to permit crossing any number of
- branches during tree rearrangements. In addition, it is possible to modify
- the extent of rearrangement explored during the sequential addition phase of
- tree building.
-
- The G U (global and user tree) option combination instructs the program to
- find the best of the user trees, and then look for rearrangements that are
- better still.
-
- The number of available rate categories has been raised from 9 to 35.
-
- The weighting mask accepts values from 0 through 35.
-
- The new B (bootstrap) option causes generation of a bootstrap sample, drawn
- from the input data.
-
- The program includes "P4" code for distributing the problem over multiple
- processors (either within one machine, or across multiple machines).
-
-
-
- Do DNAML and fastDNAml give the same answer?
-
- This is not yet fully established.
-
- The likelihoods and branch lengths sometimes differ very slightly due to
- different criteria for stopping the optimization process.
-
- An earlier version of fastDNAml had an error in generating tree rearrangements
- in the search for better trees. This did not affect the default (local)
- rearrangements, but could have caused the program to miss some global
- rearrangements. We think that it is now correct, but it is one of the most
- difficult program features to test.
-
- Little has been done to check the confidence limits on branch lengths.
-
- If you are concerned, you can supply a tree inferred by fastDNAml as a user
- tree to PHYLIP DNAML and let it (1) reoptimize branch lengths, (2) tell you
- the confidence limits and (3) tell you the tree likelihood. (It may be
- necessary to remove the quotation marks around the species names in the
- treefile.)
-
-
-
- Features in the works
-
- Test subtree exchanges (as well as moving a single subtree) in the search for
- better trees.
-
- More quickly evaluating whether a tree is a good candidate for best tree.
-
- Allowing the program to optimize any user-defined subset of branches when user
- lengths are supplied.
-
- Maintaining a list of the several best trees, not just the (single) best.
-
-
-
- Input and Options
-
-
- Basics
-
- The input to fastDNAml is similar to that used by DNAML (and the other PHYLIP
- programs). The user should consult the PHYLIP documentation for a basic
- description of the format.
-
- This version of fastDNAml expects to get its input from stdin (standard input)
- and writes its output to stdout (standard output). (There are compile time
- options to modify this, for those who care to get into such things.)
-
- On a UNIX system, it is a simple matter to redirect input from a file and
- output to a file:
-
- fastDNAml < infile > outfile
-
- On a VMS system it is only slightly more difficult. Immediately before
- running the program, one includes two commands that define the input and
- output files:
-
- $ Define/User Sys$Input infile
- $ Define/User Sys$Output outfile
- $ Run fastDNAml
-
- The default input data format is Interleaved (see I option). To help get data
- from a GenBank or similar format, numbers in the sequence data (i.e., sequence
- position numbers) are ignored.
-
- (Note that the program also writes a file called checkpoint.PID. See the R
- option below for more description.)
-
-
- 1 -- Print Data
-
- By default, fastDNAml 1.0 does not echo the sequence data to the output file.
- Option 1 reverses this.
-
-
- 3 -- Do Not Print Tree
-
- By default, fastDNAml 1.0 prints the final tree to the output file. Option 3
- reverses this.
-
-
- 4 -- Write Tree to File
-
- By default, fastDNAml 1.0 does not write a machine readable (Newick format)
- copy of the final tree to an output file. Option 4 reverses this. The tree
- output file will be called treefile.PID (where PID is the process ID under
- which fastDNAml is running).
-
-
- B -- Bootstrap
-
- Generates a bootstrap sample of the input data. Requires auxiliary data line
- of the form:
-
- B random_number_seed
-
- If the W option is used, only positions that have nonzero weights are used in
- computing the bootstrap sample. Warning: For a given random number seed, the
- sample will always be the same.
-
- PHYLIP DNAML does not include a bootstrap option. (Use the DNABOOT program.)
-
-
- C -- Categories
-
- Requires auxiliary data of the form:
-
- C number_of_categories list_of_category_rates
-
- The maximum number of categories is 35. This line is followed by a list of
- the rates for each site:
-
- Categories list_of_categories [per site, one or more lines]
-
- Category "numbers" are ordered: 1, 2, 3, ..., 9, A, B, ..., Y, Z. Category
- zero (undefined rate) is permitted at sites with a zero in a user-supplied
- weighting mask.
-
- PHYLIP DNAML is limited to categories 1 through 9. Also, in PHYLIP version
- 3.3, the categories data came after all the other auxiliary data, but before
- the user-supplied base frequencies and sequence data. If you make the C line
- your last auxiliary data line, the programs will behave the same.
-
-
- F -- Empirical Frequencies
-
- Instructs the program to use empirical base frequencies derived from the
- sequence data. Therefore the input file should not include a base frequencies
- line preceding the data.
-
-
- G -- Global
-
- If the global option is specified, there may also be an [optional] auxiliary
- data line of form:
-
- G N1
-
- or
-
- G N1 N2
-
- N1 is the number of branches to cross in rearrangements of the completed tree.
- The value of N2 is the number of branches to cross in testing rearrangements
- during the sequential addition phase of tree inference.
-
- N1 = 1: local rearrangement (default without G option)
-
- 1 < N1 < numsp-3: regional rearrangements (crossing N1 branches)
-
- N1>= numsp-3: global rearrangements (default with G option)
-
-
-
- N2 <= N1 the default N2 is 1, local rearrangements.
-
- The G option can also be used to force branch swapping on user trees, that is,
- a combination of G and U options.
-
- If the auxiliary line is supplied, it cannot be the last line of auxiliary
- data. (It may be necessary to add the T option with an auxiliary data line of
-
- T 2.0
-
- if no other auxiliary data are used.)
-
- PHYLIP DNAML does not support the auxiliary data line or branch swapping on a
- user tree.
-
-
- I -- Not Interleaved
-
- By default, fastDNAml 1.0 expects data lines for the various sequences in an
- interleaved format (as did PHYLIP 3.3 DNAML). The I option reverses the
- expected format (to non-interleaved data, in which all the data lines for one
- sequence before the next sequence begins). This is particularly useful for
- editing a GenBank or equivalent format into a valid input file (note that
- numbers within the sequence data are ignored, so it is not necessary to remove
- them).
-
- If all the data for each sequence are on one line, then the interleaved and
- non-interleaved formats are degenerate. (This is the way David Swofford's
- PAUP program writes PHYLIP format output files.) The drawback is that many
- programs do not handle long lines of text. This includes the vi and EDT text
- editors, many electronic mail programs, and some versions of FTP for VAX/VMS
- systems.
-
- PHYLIP 3.3 DNAML expects interleaved data, and does not include an I option to
- alter this. PHYLIP 3.4 DNAML accepts an I option, but the default format is
- reversed.
-
-
- J -- Jumble
-
- Randomize the sequence addition order. Requires an auxiliary input line of
- the form:
-
- J random_number_seed
-
- Note that fastDNAml explores a very small number of alternative tree
- topologies relative to a typical parsimony program. There is a very real
- chance that the search procedure will not find the tree topology with the
- highest likelihood. Altering the order of taxon addition and comparing the
- trees found is a fairly efficient method for testing convergence. Typically,
- it would be nice to find the same best tree at least twice (if not three
- times), as opposed to simply performing some fixed number of jumbles and
- hoping that at least one of them will be the optimum.
-
-
- L -- User Lengths
-
- Causes user trees to be read with branch lengths (and it is an error to omit
- any of them). Without the L option, branch lengths in user trees are not
- required, and are ignored if present.
-
-
- O -- Outgroup
-
- Use the specified sequence number for the outgroup. Requires an auxiliary
- data line of the form:
-
- O outgroup_number
-
- This option only affects the way the tree is drawn (and written to the
- treefile).
-
-
- Q -- Quickadd
-
- This option greatly decreases the time in initially placing a new sequence in
- the growing tree (but does not change the time required to subsequently test
- rearrangements). The overall time savings seems to be about 30%, based on a
- very limited number of test cases. Its downside, if any, is unknown. This
- will probably become default program behavior in the near future.
-
- If the analysis is run with a global option of "G 0 0", so that no
- rearrangements are permitted, the tree is build very approximately, but very
- quickly. This may be of greatest interest if the question is, "Where does
- this one new sequence fit into this known tree? The known tree is provided
- with the restart option, below.
-
- PHYLIP DNAML does not include anything comparable to the quickadd option.
-
-
- R -- Restart
-
- The R option causes the program to read a user-supplied tree with less than
- the full number of taxa as the starting point for sequential addition of the
- remaining taxa. Thus, the sequence data must be followed by a valid (Newick
- format) tree. (The phylip_tree/2, prolog fact format, is now also supported.)
-
- The restart option can also be used to increase the range of the search for
- alternative (better) trees. For example, you can take a tree produced with
- only "local" tree rearrangements, and increase the rearrangements to
- "regional" or "global" by combining the appropriate global option with the
- restart option. If the starting tree was written by fastDNAml, then the
- extent of rearrangements is saved with the tree, and will be used as the
- starting point for the additional search. If the tree was already globally
- optimized, then no additional searching will be performed.
-
- To support the R option, after each taxon is added to the growing tree, and
- after each round of rearrangements, the program appends a checkpoint tree to a
- file called checkpoint.PID, where PID is the process number of the running
- fastDNAml program. The last line of this file needs to be appended to the
- input file when the R option is used. (This should not be confused with the U
- (user tree) option, which expects a number followed by that number of trees.
- No additional taxa are added to user trees.)
-
- The UNIX utility tail can be used to remove the last tree from the checkpoint
- file, and the utility cat can be used to append it to the input. For example,
- the following script can be used to add a starting tree and the R option to a
- data file, and restart fastDNAml:
-
- #! /bin/sh
- if test $# -ne 1
- then echo "Usage: restart checkpoint_file"
- exit
- fi
- read first_line # first line of data file
- echo "$first_line R" # add restart option
- cat - # rest of data file
- tail -1 $1 # append last tree in checkpoint file
-
- If this shell script is in the file called restart, then one might use the
- command:
-
- restart checkpoint.21312 < infile | fastDNAml > new_outfile
- ^script ^checkpoint tree ^data ^dnaml program ^output_file
-
- If this is too opaque, don't worry about it, or talk with your local unix
- wizard. In the mean time, this and other useful shell scripts are provided
- with the program.
-
- PHYLIP DNAML does not write checkpoint trees and does not have a restart
- option.
-
-
- T -- Transition/transversion ratio
-
- Use a user-specified ratio of transition to transversion type substitutions.
- Without the T option, a value of 2.0 is used. Requires an auxiliary data line
- of the form:
-
- T ratio
-
- (Note that a T option with a
-
- T 2.0
-
- auxiliary data line does nothing, but can provide a last auxiliary data line
- following optional auxiliary data, such as
-
- G 3 2
-
- or
-
- Y 2
-
- that cannot be last.)
-
-
- U -- User Tree(s)
-
- Read an input line with the number of user-specified trees, followed by the
- specified number of trees. These data immediately follow the sequence data.
-
- The trees must be in Newick format, and terminated with a semicolon. (The
- program also accepts a pseudo_newick format, which is a valid prolog fact.)
-
- The tree reader in this program is more powerful than that in PHYLIP 3.3. In
- particular, material enclosed in square brackets, [ like this ], is ignored as
- comments; taxa names can be wrapped in single quotation marks to support the
- inclusion of characters that would otherwise end the name (i.e., '(', ')',
- ':', ';', '[', ']', ',' and ' '); names of internal nodes are properly
- ignored; and exponential notation (such as 1.0E-6) for branch lengths is
- supported.
-
-
- W -- Weights
-
- Read user-specified column weighting information. This option requires
- auxiliary data of the form:
-
- Weights list_of_weight_values [per site, one or more lines]
-
- It is necessary that the weight values not start before the 11'th character in
- the line, or some of them will be lost. Weights from 0 to 35 are indicated by
- the series: 0, 1, 2, 3, ..., 9, A, B, ..., Y, Z.
-
- PHYLIP DNAML does not support user weights with values other than 1 or 0.
- This limit has been removed in fastDNAml 1.0 to permit the use of user weights
- as a mechanism for representing a bootstrap sample (that is, only the
- auxiliary data lines change, not the body of the data file).
-
-
- Y -- Write Tree
-
- Output final tree to an output file called treefile.PID. By default the tree
- is in Newick format.
-
- If an auxiliary input line of the form
-
- Y 2
-
- is also included, then the tree output file is written as a prolog fact:
-
- pseudo_newick([Comment], (Subtree1, Subtree2, Subtree3):Length).
-
- where each subtree is either
-
- (Subtree1,Subtree2):Length
-
- or
-
- Label:Length
-
- Because this auxiliary input line is optional, it cannot be the last auxiliary
- data line.
-
- PHYLIP DNAML does not append the PID (process ID) to the tree file name and
- does not support the prolog format output.
-